SWOOP: Top-k Similarity Joins over Set Streams
نویسندگان
چکیده
We provide efficient support for applications that aim to continuously find pairs of similar sets in rapid streams of sets. A prototypical example setting is that of tweets. A tweet is a set of words, and Twitter emits about half a billion tweets per day. Our solution makes it possible to efficiently maintain the top-k most similar tweets from a pair of rapid Twitter streams, e.g., to discover similar trends in two cities if the streams concern cities. Using a sliding window model, the top-k result changes as new sets in the stream enter the window or existing ones leave the window. Maintaining the top-k result under rapid streams is challenging. First, when a set arrives, it may form a new pair for the top-k result with any set already in the window. Second, when a set leaves the window, all its pairings in the top-k are invalidated and must be replaced. It is not enough to maintain the k most similar pairs, as less similar pairs may eventually be promoted to the top-k result. A straightforward solution that pairs every new set with all sets in the window and keeps all pairs for maintaining the top-k result is memory intensive and too slow. We propose SWOOP, a highly scalable stream join algorithm that solves these issues. Novel indexing techniques and sophisticated filters efficiently prune useless pairs as new sets enter the window. SWOOP incrementally maintains a stock of similar pairs to update the top-k result at any time, and the stock is shown to be minimal. Our experiments confirm that SWOOP can deal with stream rates that are orders of magnitude faster than the rates of existing approaches.
منابع مشابه
Top-k Similarity Join over Multi-valued Objects
The top-k similarity joins have been extensively studied and used in a wide spectrum of applications such as information retrieval, decision making, spatial data analysis and data mining. Given two sets of objects U and V, a top-k similarity join returns k pairs of most similar objects from U×V. In the conventional model of top-k similarity join processing, an object is usually regarded as a po...
متن کاملSolving similarity joins and range queries in metric spaces with the list of twin clusters
The metric space model abstracts many proximity or similarity problems, where the most frequently considered primitives are range and k-nearest neighbor search, leaving out the similarity join, an extremely important primitive. In fact, despite the great attention that this primitive has received in traditional and even multidimensional databases, little has been done for general metric databas...
متن کاملAlgorithms for continuous top-k processing in social networks
Information streams are mainly produced today by social networks, but current methods for continuous top-k processing of such streams are still limited to contentbased similarity. We present the SANTA algorithm, able to handle also social network criteria and events, and report a preliminary comparison with an extension of a state-of-the-art algorithm.
متن کاملEnsemble-based Top-k Recommender System Considering Incomplete Data
Recommender systems have been widely used in e-commerce applications. They are a subclass of information filtering system, used to either predict whether a user will prefer an item (prediction problem) or identify a set of k items that will be user-interest (Top-k recommendation problem). Demanding sufficient ratings to make robust predictions and suggesting qualified recommendations are two si...
متن کاملOptimizing Multiple Top-K Queries over Joins
Advanced Data Mining applications require more and more support from relational database engines. Especially clustering applications in high dimensional features space demand a proper support of multiple Top-k queries in order to perform projected clustering. Although some research tackles to problem of optimizing restricted ranking (top-k) queries, there is no solution considering more than on...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1711.02476 شماره
صفحات -
تاریخ انتشار 2017